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systems 



(57) A set of predicted readers are determined for a 
data block subject to a write request in a shared-memory 
multiprocessor system by first determining a current set 
of readers of the data block, and then generating the set 
of predicted readers based on the current set of readers 
and at least one additional set of readers representative 
of at least a portion of a global history of a directory as- 
sociated with the data block, in one possible implemen- 
tation, the set of predicted readers are generated by ap- 
plying a function to the current set of readers and one 
or more additional sets of readers. The function may be, 
for example, a union function, an intersection function 



or a pattern-based function, and the directory and data 
block may be elements of a memory associated with a 
particular processor node of the multiprocessor system. 
The global history of the directory comprises multiple 
sets of previous readers processed by the directory, with 
the total number of sets of previous readers correspond- 
ing to a designated history depth associated with gen- 
eration of the set of predicted readers. The prediction 
process may use additional information in conjunction 
with the directory information, such as a designated sub- 
set of cache address information, processor node iden- 
tification information, or program counter information. 



FIG. 1 



J 



I oo 



CM 

in 

CM 
CD 



Q_ 

LU 



NODE A 



106A 

S 



108A 



1 

PROC. 


MEMORY 
(DIRECTORY 
AND L3) 


CACHE 
L(L1, 12) 


U 1 



104A 




NODE 



106B 



1 

PROC. 


MEMORY 
(DIRECTORY 
AND 13) 


CACHE 
L(L1.L2) 


II 1 . 







\ 






HOC 


NI 






NODE C 



106C 
S 



108C 



1 

PROC. 


MEMORY 
(DIRECTORY 
AND L3) 


CACHE 
L(L1. 12) 


1 1 1 



INTERCONNECTION NETWORK 



Printed by Jouve. 75001 PARIS (FR) 



1 



EP1 162 542 A1 



2 



Description 

Field of the invention 

[0001] The present invention relates generally to mul- 
tiprocessor computers and other types of processing 
systems which include multiple processors, and more 
particularly to memory prediction techniques suitable for 
use in such systems. 

Background of the Invention 

[0002] In a shared-memory multiprocessor system, it 
appears to a user that all processors read and modify 
state in a single shared memory store. A substantial dif- 
ficulty in implementing such a system, and particularly 
a distributed version of such a system, is propagating 
values from one processor to another, in that the actual 
values are created close to one processor but might be 
used by many other processors in the system. If the im- 
plementation could accurately predict the sharing pat- 
terns of a given program, the processor nodes of a dis- 
tributed multiprocessor system could spend more of 
their time computing and less of their time waiting for 
values to be fetched from remote locations. Despite the 
development of processor features such as non-block- 
ing caches and out-of-order instruction execution, the 
relatively long access latency in a distributed shared- 
memory system remains a serious impediment to per- 
formance. 

[0003] Prediction techniques have been used to re- 
duce access latency in distributed shared-memory sys- 
tems by attempting to move data from their creation 
point to their expected use points as early as possible. 
These prediction techniques typically supplement the 
normal shared-memory coherence protocol, which is 
concerned primarily with correct operation and second- 
arily with performance. In a distributed shared-memory 
system, the coherence protocol, which is typically direc- 
tory-based, keeps processor caches coherent and 
transfers data among the processor nodes. In essence, 
the coherence protocol carries out all communication in 
the system. Coherence protocols can either invalidate 
or update shared copies of a data block whenever the 
data block is written. Updating involves forwarding data 
from producer nodes to consumer nodes but does not 
provide a feedback mechanism to determine the useful- 
ness of data forwarding. Invalidation provides a natural 
feedback mechanism, in that invalidated readers must 
have used the data, but invalidation provides no means 
to forward data to its destination. 
[0004] A conventional prediction approach described 
in S.S. Mukherjee and M.D. Hill, "Using Prediction to Ac- 
celerate Coherence Protocols ," Proceedings of the 25th 
Annual International Symposium on Computer Architec- 
ture (ISCA), June-July 1998, uses address-based 2-lev- 
el predictors at the directories and caches of the proc- 
essor nodes of a multiprocessor system to track and 



predict coherence messages. A. Lai and B. Falsafi, 
"Memory Sharing Predictor: The Key to a Speculative 
Coherent DSM," Proceedings of the 26th Annual ISCA, 
May 1999, describe how these 2-level predictors can be 

5 modified to use less space, by coalescing messages 
from different nodes into bitmaps, and show how the 
modified predictors can be used to accelerate reading 
of data. Another set of known prediction techniques, de- 
scribed in S. Kaxiras and J.R. Goodman, "Improving 

10 CC-NUMA Performance Using Instruction-Based Pre- 
diction," Proceedings of the 5th Annual IEEE Symposi- 
um on High-Performance Computer Architecture (HP- 
CA), January 1999, provides instruction-based predic- 
tion for migratory sharing, wide sharing and producer- 

15 consumer sharing. Since static instructions are far fewer 
than data blocks, instruction-based predictors require 
less space to capture sharing patterns. 
[0005] Despite the advances provided by the above- 
identified prediction techniques, a need remains for ad- 

20 ditional improvements, so as to further reduce access 
latency and thereby facilitate the implementation of 
shared-memory multiprocessor systems. 

Summary of the Invention 

25 

[0006] The invention provides improved techniques 
for determining a set of predicted readers of a data block 
subject to a write request in a shared-memory multiproc- 
essor system. In accordance with an aspect of the tn- 

30 vention, a current set of readers of the data block are 
determined, and then the set of predicted readers is 
generated based on the current set of readers and at 
least one additional set of readers representative of at 
least a portion of a global history of a directory associ- 

35 ated with the data block. In one possible implementa- 
tion, the set of predicted readers are generated by ap- 
plying a function to the current set of readers and one 
or more additional sets of readers. The function may be, 
for example, a union function, an intersection function 

*o or a pattern-based function, and the directory and data 
block may be elements of a memory associated with a 
particular processor node of the multiprocessor system. 
[0007] The global history of the directory comprises 
multiple sets of previous readers processed by the di- 

45 rectory, with the total number of sets of previous readers 
corresponding to a designated history depth associated 
with generation of the set of predicted readers. The glo- 
bal history may be maintained, for example, in a shift 
register having a number of storage locations corre- 

50 sponding to the designated history depth. The history 
depth is preferably selected as a value greater than two, 
such as four. 

[0008] In operation, the directory or other processor 
node element associated with the data block subject to 
55 the write request sends an invalidation request to each 
of the readers in the current set of readers, and upon 
receipt of an invalidation acknowledgment from each of 
the readers in the current set of readers, sends a valid 
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copy of the data block to a writer which generated the 
write request. Each reader in the system may maintain 
an accessed bit for each of a number of data blocks, the 
accessed bit of a particular reader for a given data block 
indicating whether the particular reader has actually 
read the given data block. The accessed bit information 
may be sent by the particular reader to the directory in 
conjunction with an invalidation acknowledgment. After 
the requested write on the data block is completed, the 
resulting data block is sent to each of the readers in the 
set of predicted readers. 

[0009] In accordance with another aspect of the in- 
vention, the above-noted function may be selected dy- 
namically. For example, the function may be selected 
on a per-program basis, such that each of a number of 
programs running in the multiprocessor system can in- 
dependently determine the function to be applied to de- 
termine the set of predicted readers. As another exam- 
ple, the function may be selected under program control 
at run time by a given program running on the multiproc- 
essor system. As a further example, the function may 
be selected on a per-page basis, such that the function 
applied can be determined independently for each of a 
number of memory pages, each of which may include 
multiple data blocks. As yet another example, the func- 
tion may be selected based at least in part on informa- 
tion regarding network utilization. Various combinations 
of these and other types of information may also be used 
in the dynamic selection of the above-noted function. 
[001 0] The prediction process in accordance with the 
present invention may use additional information in con- 
junction with the above-described directory information, 
such as a designated subset of cache address informa- 
tion, processor node identification information, or pro- 
gram counter information. 

[0011] Advantageously, the prediction techniques of 
the present invention provide improved prediction accu- 
racy, both in terms of fewer false positives and fewer 
false negatives, relative to conventional techniques. 
[0012] These and other features and advantages of 
the present invention will become more apparent from 
the accompanying drawings and the following detailed 
description. 

Brief Description of the Drawings 
[0013] 

FIGS. 1 and 2 illustrate the operation of a distributed 
shared-memory multiprocessor system in which a 
directory-based predictor may be implemented in 
accordance with the present invention. 
FIG. 3 shows an example of sequence of events 
and an aggregation of readers. 
FIG. 4 shows an example of directory-based pre- 
diction in accordance with the invention. 
FIG. 5 is a flow diagram of a directory-based pre- 
diction process in accordance with the invention. 



FIG. 6 shows a set of tables listing examples of pre- 
dictors in accordance with the invention 

Detailed Description f the Invention 

5 

[0014] The invention will be illustrated herein in con- 
junction with exemplary distributed shared-memory 
multiprocessor systems. It should be understood, how- 
ever, that the invention is more generally applicable to 

10 any shared-memory multiprocessor system in which it 
is desirable to provide improved performance through 
the use of directory-based prediction. The term "multi- 
processor system" as used herein is intended to include 
any device in which retrieved instructions are executed 

15 using one or more processors. Exemplary processors 
in accordance with the invention may include, for exam- 
ple, microprocessors, central processing units (CPUs), 
very long instruction word (VLIW) processors, single-is- 
sue processors, multi-issue processors, digital signal 

20 processors, application-specific integrated circuits 
(ASICs), personal computers, mainframe computers, 
network computers, workstations and servers, and oth- 
er types of data processing devices, as well as portions 
and combinations of these and other devices. 

25 [0015] FIGS. 1 and 2 illustrate the handling of exam- 
ple read and write requests, respectively, in a distributed 
shared-memory multiprocessor system 100. The sys- 
tem 100 is an example of one type of, system in which 
the directory-based prediction of the present invention 

30 may be implemented. The system 100 includes nodes 
A, B and C, which are connected to an interconnection 
network 102 via corresponding network interfaces (NIs) 
104A, 104 Band 104C, respectively. The nodes A, Band 
C include processors 106A, 106B and 106C, memories 

35 1 08A, 108B and 108C, and buses 110A, 110B and 
110C, respectively, arranged as shown. Within a given 
node /'of the system 100, / = A, B, C, the processor 106/, 
memory 108/ and network interface 104/ are each cou- 
pled to and communicate over the corresponding bus 

40 110/. 

[001 6] Associated with each of the processors 1 06/ in 
the system 100 is a set of caches L1 and L2 t and asso- 
ciated with each of the memories 1 08/ is a directory and 
a cache L3. Each of the memories 108/ is managed by 

45 its corresponding unique directory. The memories 108/ 
or portions thereof are referred to herein as data blocks 
or simply blocks. Although there are multiple directories 
in the system 100, each block in this illustrative embod- 
iment is managed by just one of them. If a would-be 

so reader or would-be writer does not have an up-to-date 
copy of a block, it asks the corresponding directory to 
find the most recent version of the block. The directory 
may have to invalidate one or more current copies of the 
block in order to answer a request. 

55 [0017] Also illustrated in FIG. 1 is an example of a 
read operation in which processor 1 06A of node A reads 
data from memory 1 08B of node B. As part of this oper- 
ation, read request (1 ) goes from node A to node B, and 
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reply (2) returns from node B to node A. Node A caches 
the data in its local cache hierarchy, i.e., caches L1 , L2 
and L3. The directory in node B stores an indication that 
node A has a copy of the data. Other nodes read data 
from node B in the same manner. 
[0018] It should be noted that the terms "reader" and 
"writer" as used herein are intended to include without 
limitation a given processor node or its associated proc- 
essor, as well as elements or portions of a processor 
node or its associated processor. 
[0019] FIG. 2 shows a write operation in which proc- 
essor 106C of node C writes the same data residing in 
memory 108B of node B. As part of this operation, write 
request (1 ) goes from node C to node B. Because the 
directory in node B knows that node A has a copy of the 
data, it sends an invalidation request (2) to node A. Node 
A sends an acknowledgment (3) of the invalidation of its 
copy of the data. Node B then sends the data (4) to node 
C for writing since there are no more copies in the sys- 
tem. 

[0020] A given memory block in the system 100 thus 
may be viewed as alternating between phases of being 
written by a single processor and being read by one or 
more processors. The directory associated with the giv- 
en block manages these alternating phases, maintain- 
ing a consistent version of the block at all times. 
[0021] It should be noted that the particular arrange- 
ments of caches and cache levels shown in FIGS. 1 and 
2 are examples only, and should not be construed as 
limiting the scope of the present invention in any way. 
The invention may be implemented using a wide variety 
of different cache architectures or multiprocessor sys- 
tem configurations. 

[0022] FIG. 3 illustrates an example of these alternat- 
ing phases for a single block in a system having five 
nodes denoted 1 , 2, 3, 4 and 5. The left side of the figure 
shows the raw sequence of read and write events relat- 
ing to the single block, and the right side of the figure 
shows a summary of the above-noted phases. As is ap- 
parent from this example, it is generally safe for multiple 
readers to examine the most-recently-generated ver- 
sion of a block. 

[0023] The present invention in an illustrative embod- 
iment provides a directory-based prediction mechanism 
which predicts the next set of readers of a block when 
a write request goes from the writer to the directory as- 
sociated with the block. The mechanism predicts a likely 
set of readers of the value produced by the writer, and 
after the writer has finished writing, this prediction is 
used to forward the data to all the predicted readers. 
Unlike conventional predictors which distinguish among 
blocks or among instructions to keep separate histories 
for blocks in the system, the prediction mechanism of 
the present invention merges together multiple sets of 
readers for multiple blocks served by the directory. This 
information is referred to herein as the global history of 
the directory. 

[0024] In the example implementation of the illustra- 



tive embodiment to be described in conjunction with 
FIG. 4 below, a history depth of four is used, i.e., the 
predicted set of readers generated for a current write 
operation on a given block is determined as a function 
5 of the current set of readers of that block and the three 
other most recent sets of readers stored in a predictor 
shift register. 

[0025] FIG. 4 shows an example of the operation of a 
directory-based predictor in the illustrative embodiment 

10 of the invention. In this example, a write request is re- 
ceived for a data block X associated with a memory and 
directory 120. The current readers of the data block X 
are a set of nodes {a, b, c} of a multiprocessor system 
which includes nodes denoted a, b, c, d, e, f, g, h, i, j, k, 

15 |, m, etc. Each of the nodes may represent a node of a 
multiprocessor system such as that illustrated in con- 
junction with FIGS. 1 and 2. The predictor in this exam- 
ple uses a shift register 122 in a manner to be described 
below. 

20 [0026] FIG. 5 shows a flow diagram of the general 
processing operations of the directory-based predictor 
of the FIG. 4 example. The general operations will first 
be described with reference to FIG. 5, and then the ap- 
plication of the general operations to the FIG. 4 example 

25 will be described in detail. 

[0027] In step 200 of FIG. 5, a writer sends a write 
request message to the directory of the block to be writ- 
ten. The directory in step 202 invalidates the current 
readers. Steps 204 and 206 are then performed by each 

30 of the readers. In step 204, a given node corresponding 
to a potential reader receives the invalidation from the 
directory. In step 206, the node returns an "accessed 
bit" with an acknowledgment of the invalidation. 
[0028] As indicated in step 208, the directory waits for 

35 invalidation acknowledgments from the readers. A set 
of true readers is determined as the set of invalidated 
nodes for which the returned accessed bit is set. The 
directory in step 210 provides information identifying the 
set of true readers to the predictor. 

40 [0029] The predictor then adds the set of true readers 
to its shift register (step 212), discards the oldest set of 
previous readers in the shift register (step 214), predicts 
a function of the sets using an intersection or union op- 
eration (step 21 6), and sends the prediction to the writer 

45 (step 218). 

[0030] The directory in step 220 sends a valid copy of 
the block to the writer. This copy may be sent to the writ- 
er along with the prediction of step 21 8. In step 222, time 
passes until the writer finishes the write operation. After 

so the write operation is completed, the writer uses the in- 
formation in the prediction to forward the new block to 
each predicted reader, as shown in step 224. 
[0031] Suitable techniques for determining an appro- 
priate time for forwarding a new block to each predicted 

55 reader are described in, e.g., S. Kaxiras, "Identification 
and Optimization of Sharing Patterns for Scalable 
Shared-Memory Multiprocessors," PhD. Thesis, Com- 
puter Sciences, University of Wisconsin-Madison, 1998, 
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and in the above-cited reference A. Lai and B. Falsafi, 
"Memory Sharing Predictor: The Key to a Speculative 
Coherent DSM," Proceedings of the 26th Annual ISCA, 
May 1999, both of which are incorporated by reference 
herein. 5 
[0032] The choice of union or intersection function in 
step 216 of FIG. 5 generally depends on desired degree 
of aggressiveness in the data forwarding. For example, 
in high-bandwidth systems, the more aggressive data 
forwarding associated with the union function may be 10 
more appropriate, while for low-bandwidth systems, the 
intersection function maybe more appropriate. It should 
be noted that these functions are given by way of exam- 
ple only, and the invention can be implemented using 
other types of functions. As another example, pattern- 15 
based functions can be used in conjunction with the 
present invention. Such functions are described in 
greater detail in, e.g., T. Yeh and Y. Patt, "Two-Level 
Adaptive Branch Prediction," Proceedings of the 24th 
Annual ACM/IEEE International Symposium and Work- 
shop on Microarchitecture, Los Alamitos, CA, Novem- 
ber 1991, which is incorporated by reference herein. 
[0033] The choice between union, intersection or oth- 
er functions in step 21 6 can be made on a dynamic ba- 
sis. For example, the choice may be made on a per- 
program basis, such that each program can set its own 
mode of operation. As another example, the choice of 
function may be implemented under program control, 
such that programs can change the mode of operation 
at run-time according to their needs. The choice could 
alternatively be made on a per-page basis, with each 
memory page, which may be comprised of multiple data 
blocks, having its own mode. In this case, an operating 
system may notify the predictor about the mode of op- 
eration of various pages. As yet another example, the 
choice of function could be made in accordance with 
network utilization, e.g., with low network utilization call- 
ing for the union function, and high network utilization 
calling for the intersection function. In this case, a net- 
work monitor device may be used to supply feedback to 
the prediction mechanism. 

[0034] Referring again to the example of FIG. 4, when 
the write request for data block X is received, the current 
readers are the processors in the set {a, b, c}. The pre- 
dictor shift register 122 is shifted by one as shown, and 
the set {a, b, c} is installed in the top slot, denoted slot 
0. As a result of the shift, slots 1 , 2 and 3 contain the 
sets {a, c, e, f, g}, {a, c, d} and {a, h, i, c}, respectively, 
and the set {k, I, m} is dropped from the shift register. 
The current contents of the shift register 1 22 at a given 
point in time represent the global history of the corre- 
sponding directory. The directory invalidates the current 
readers by sending invalidation requests to the nodes 
a, b and c, waits for acknowledgment of the invalidation, 
and later sends a valid copy of data block X to the re- 
questing writer. 

[0035] The predictor determines the union or the in- 
tersection of the sets in the shift register 122, in accord- 



ance with step 216 of FIG. 5, choosing union or inter- 
section based on one or more of the factors described 
above. The union of the sets in the shift register is the 
set {a, b, c, d, e, f, g, h, i}, while the intersection of the 
sets in the shift register is the set {a, c}. In either case, 
the result is a set of predicted readers which is sent by 
the predictor to the writer. After the write operation on 
data block X is completed, the writer forwards the new 
block to each of the predicted readers. The triggering of 
the data forwarding can be based on a timer, or by the 
next write to the directory regardless of which data block 
is written, or by the next read to data block X, or by other 
suitable techniques. The data forwarding may be imple- 
mented by the directory fetching a copy of the data from 
the writer and sending it to the predicted reader nodes. 
[0036] It should be noted that, although the predictor 
in the FIG. 4 example uses a history depth of four, i.e., 
the shift register 122 stores the four most recent sets of 
readers for a given data block, the present invention can 
be implemented using other history depths, including 
history depths greater or less than four. Conventional 
predictors generally utilize a history depth no greater 
than two. 

[0037] In order to provide accurate feedback to the 
above-described prediction mechanism, each reader 
generally must be able to distinguish between a predict- 
ed read and an actual read. When a writer gains exclu- 
sive access to a cache block, a multiprocessor system 
in accordance with the invention predicts the future set 
of readers of the block, then ensures that copies of the 
block are forwarded to those predicted readers after the 
write has completed. In order to close the feedback loop, 
the system must find out how many of those predicted 
readers actually used the block. To tell whether this is 
the case, each reader in the system may maintain the 
above-noted "accessed bit" for each locally cached line. 
This accessed bit is similar to the so-called "dirty bit" 
kept for paging in a virtual memory system, except that 
the accessed bit will be set when a cache block is read, 
rather than when it is written. Also, the accessed bit 
should be maintained at cache block granularity, while 
dirty bits are typically maintained at page granularity. At 
the next invalidation, each reader piggybacks the ac- 
cessed bit information on the invalidation acknowledg- 
ment. The system can then use the accessed bits to up- 
date its state for the next prediction. 
[0038] It should be noted that processing operations 
referred to herein as being performed or otherwise im- 
plemented by a directory may be performed or other- 
wise implemented by an associated element of a proc- 
essor node, such as by a processor under program con- 
trol. 

[0039] In alternative embodiments of the invention, 
the directory information can be supplemented by other 
information in order to provide further performance im- 
provements. For example, the directory information can 
be supplemented with a designated subset of cache- 
block-address information. Advantageously, such an ar- 
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rangement uses less information than conventional ad- 
dress-based prediction techniques, while also achieving 
a higher prediction accuracy. In other embodiments of 
the invention, the directory information with or without 
the subset of the cache address information can be 
combined with other types of information, such as proc- 
essor node and program counter information. For exam- 
ple, the invention may utilize various combinations of di- 
rectory, address, processor node and program counter 
information in the process of determining a set of pre- 
dicted readers for a given write request. 
[0040] As the term is used herein, the "global history" 
of a directory is intended to include not only a history 
based on directory information alone, but also a history 
which includes directory information as well as an 
amount of additional information, such as address, proc- 
essor node or program counter information, which is 
less than the full available amount of such additional in- 
formation. For example, a global history may include di- 
rectory information supplemented with a small number 
of address bits, i.e., an amount of address bits less than 
a full set of available address bits. 
[0041] FIG. 6 shows a set of six tables listing exam- 
ples of predictors in accordance with the invention and 
corresponding performance simulation results. The pre- 
dictors shown are based on various combinations of one 
or more of directory (dir), address (add), processor node 
(pid) and program counter (pc) information. The predic- 
tor names are of the form prediction-function(in~ 
dex^epM, where prediction-function indicates the func- 
tion used to update the predictor, index indicates the 
particular combination of directory, address, processor 
node and program counter information used by the pre- 
dictor, and depth is the history depth. In the case of ad- 
dress (add) or program counter (pc) information, the cor- 
responding identifier includes a subscript indicating the 
corresponding number of bits of information. 
[0042] The predictors shown in FIG. 6 are also clas- 
sified as either direct or forwarded to indicate the par- 
ticular update mechanism used. In a direct update 
mechanism, each time a data block is written, the set of 
invalidated true readers are used as history to generate 
the new prediction. In a forwarded update mechanism, 
when a writer invalidates a set of readers associated 
with another node, it forwards this history to the appro- 
priate predictor entry so it can be used by the correct 
writer. Forwarded update thus requires last-writer infor- 
mation for each data block so invalidated readers can 
be associated with a specific writer. Tables 1 , 2 and 3 of 
FIG. 6 list predictors which utilize a direct update mech- 
anism, while Tables 4, 5 and 6 list predictors which utilize 
a forwarded update mechanism. 
[0043] By way of example, the predictor union 
(pid+dir+add 4 ) 4 in Table 6 represents a prediction 
scheme using direct update, indexing its prediction state 
using the processor number, directory node, and four 
bits of data-block address, and unioning the last two 
sharing bitmaps to predict the next one for each index. 



As another example, a last-bitmap predictor indexed by 
directory node and eight bits of address information may 
be denoted union(dir+add 8 ) 1 or inter(dir+add 8 ) 1 , de- 
pending on the particular function used. It should be not- 
5 ed that a union-based or intersection-based predictor 
with a history depth of one is the same as a last-bitmap 
predictor. 

[0044] Additional details regarding these and other 
aspects of predictors are described in S. Kaxiras and C. 

*o Young, "Coherence Communication Prediction in 
Shared Memory Multiprocessors," Proceedings of the 
6th Annual IEEE Symposium on High-Performance 
Computer Architecture (HPCA), January 2000, which is 
incorporated by reference herein. 

15 [0045] For each of the example predictors shown in 
FIG. 6, a number of performance parameters are listed. 
These include predictor size, sensitivity, specificity, pre- 
dictive value of a positive test (PVP) and predictive value 
of a negative test (PVN). 

20 [0046] The predictor size is measured as log 2 of the 
number of bits utilized by the predictor. 
[0047] Sensitivity is the ratio of correct predictions to 
the sum of correct predictions and omitted predictions, 
and indicates how well the predictor predicts sharing 

25 when sharing will indeed take place. A sensitive predic- 
tor is good at finding and exploiting opportunities for 
sharing, while an insensitive predictor misses many op- 
portunities. 

[0048] Specificity is the ratio of avoided predictions to 
30 the sum of avoided predictions and extra predictions, 
and indicates the likelihood that no resources will be 
wasted on unshared data. 

[0049] PVP is the ratio of correct predictions to the 
sum of correct and extra predictions, and provides an 

35 indication of the percentage of useful data-forwarding 
traffic out of all data-forwarding traffic. 
[0050] PVN is the ratio of avoided predictions to the 
sum of avoided predictions and omitted predictions, and 
provides an indication as to how likely an unshared 
block is to be correctly predicted not to be shared. 
[0051] Tables 1 and 4 show the ten predictors with the 
highest PVPs under direct update and forwarded up- 
date, respectively, of a set of possible predictors for 
which performance was simulated. All of the predictors 

45 in this group are deep-history intersection predictors, 
which will maximize PVP by speculating only on very 
stable sharing relationships. Two of the top-ten 
schemes are common to the two tables. It can be seen 
that direct update and forwarded update have very little 

so influence on PVP. However, the forwarded schemes on 
average are more sensitive. None of the high-PVP 
schemes is sensitive compared to a last-bitmap or un- 
ion-predictor scheme. This means that they will gener- 
ate very productive traffic, but they will miss many op- 

55 portunities for sharing. 

[0052] Table 2 shows the ten most sensitive schemes 
in the set of possible predictors using direct update. All 
are union schemes with the maximum history depth 
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used in this example, i.e., a history depth of 4. All 
schemes are roughly comparable in sensitivity, but with 
different values of PVP. It is interesting to note that by 
far the least expensive scheme (union(dir+add 2 ) 4 ) is 
fifth-best overall in terms of sensitivity. 
[0053] Table 3 shows the ten most sensitive schemes 
in the set of possible predictors using forwarded update. 
There is very little difference between the direct- and for- 
warded-update schemes.'Six of the top ten schemes are 
common to the two lists, and the statistics differ little 
from column to column. 

[0054] Tables 5 and 6 show the top ten predictors in 
the set of possible forwarded update predictors in terms 
of specificity and sensitivity, respectively. 
[0055] It should again be emphasized that the predic- 
tors shown in FIG. 6 are examples only, and the inven- 
tion may be implemented using other types of predic- 
tors. For example, although the maximum history depth 
in the FIG. 6 examples is four, other predictors may use 
greater history depths. 

[0056] The present invention may be configured to 
meet the requirements of a variety of different process- 
ing applications and environments, using any desired 
types and arrangements of processors. The above-de- 
scribed embodiments of the invention are therefore in- 
tended to be illustrative only. Numerous alternative em- 
bodiments within the scope of the following claims will 
be apparent to those skilled in the art. 



Claims 

1. A method of determining a set of predicted readers 
of a data block in a multiprocessor system, the 
method comprising the steps of: 

determining a current set of readers of a data 
block which is subject to a write request; and 
generating the set of predicted readers based 
on the current set of readers and at least one 
additional set of readers representative of at 
least a portion of a global history of a directory 
associated with the data block. 

2. The method of claim 1 wherein the generating step 
further comprises the step of applying a function to 
the current set of readers and at least one additional 
set of readers. 

3. The method of claim 2 wherein the function com- 
prises at least one of a union function, an intersec- 
tion function and a pattern-based function. 

4. The method of claim 1 wherein the directory and the 
data block comprise elements of a memory associ- 
ated with a processor node of the multiprocessor 
system. 



5. The method of claim 1 wherein the global history of 
the directory comprises a plurality of sets of previ- 
ous readers processed by the directory, the total 
number of the plurality of sets of previous readers 

5 corresponding to a designated history depth asso- 
ciated with generation of the set of predicted read- 
ers. 

6. The method of claim 5 wherein the global history is 
10 maintained in a shift register having a number of 

storage locations corresponding to the designated 
history depth. 

7. The method of claim 6 wherein the history depth is 
15 greater than two. 

8. The method of claim 1 wherein each reader in the 
system maintains an accessed bit for each of a plu- 
rality of data blocks, the accessed bit of a particular 

20 reader for a given data block indicating whether the 
particular reader has actually read the given data 
block. 

9. The method of claim 8 wherein accessed bit infor- 
25 mation is sent by the particular reader to the direc- 
tory in conjunction with an invalidation acknowledg- 
ment. 

10. The method of claim 1 wherein after the requested 
30 write on the data block is completed, the resulting 

data block is sent to each of the readers in the set 
of predicted readers. 

11. The method of claim 1 wherein the generating step 
35 utilizes a direct update mechanism. 

12. The method of claim 1 wherein the generating step 
utilizes a forwarded update mechanism. 

<o 13. The method of claim 2 wherein the function is se- 
lected dynamically. 

14. The method of claim 13 wherein the function is se- 
lected on a per-program basis, such that each of a 
45 plurality of programs running in the multiprocessor 
system can independently determine the function 
to be applied to determine the set of predicted read- 
ers. 

50 15. The method of claim 1 3 wherein the function is se- 
lected under program control and can be selected 
at run time by a given program running on the mul- 
tiprocessor system. 

55 16. The method of claim 13 wherein the function is se- 
lected on a per-page basis, such that the function 
applied can be determined independently for each 
of a plurality of memory pages, each of which may 
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be comprised of one or more data blocks. 

17. The method of claim 1 3 wherein the function is se- 
lected based at least in part on information regard- 
ing network utilization. 

18. The method of claim 1 wherein the generating step 
further includes utilizing information regarding the 
global history of the directory in conjunction with at 
least a subset of cache address information asso- 
ciated with one or more of the readers to determine 
the set of predicted readers. 

19. The method of claim 1 wherein the generating step 
further includes utilizing information regarding the 
global history of the directory in: conjunction with 
processor node information associated with one or 
more of the readers to determine the set of predict- 
ed readers. 

20. The method of claim 1 wherein the wherein the gen- 
erating step further includes utilizing information re- 
garding the global history of the directory in con- 
junction with program counter information associat- 
ed with one or more of the readers to determine the 
set of predicted readers. 

21. The method of claim 1 wherein each of at least a 
subset of the readers corresponds to a particular 
processor node in the multiprocessor system. 

22. An apparatus for determining a set of predicted 
readers of a data block in a multiprocessor system, 
the apparatus comprising: 

a processor node operative to determine a 
current set of readers of a data block which is sub- 
ject to a write request, and to implement a prediction 
mechanism which generates the set of predicted 
readers based on the current set of readers and at 
least one additional set of readers representative of 
at least a portion of a global history of a directory 
associated with the data block. 

23. The apparatus of claim 22 wherein the set of pre- 
dicted readers is generated at least in part by ap- 
plying a function to the current set of readers and 
at least one additional set of readers. 

24. The apparatus of claim 23 wherein the function 
comprises at least one of a union function, an inter- 
section function and a pattern-based function. 

25. The apparatus of claim 22 wherein the directory and 
the data block comprise elements of a memory as- 
sociated with the processor node. 

26. The apparatus of claim 22 wherein the global histo- 
ry of the directory comprises a plurality of sets of 



previous readers processed by the directory, the to- 
tal number of the plurality of sets of previous read- 
ers corresponding to a designated history depth as- 
sociated with generation of the set of predicted 
5 readers. 

27. The apparatus of claim 26 wherein the global histo- 
ry is maintained in a shift register having a number 
of storage locations corresponding to the designat- 

10 ed history depth. 

28. A multiprocessor system comprising: 

a plurality of processor nodes, at least a given 
one of the processor nodes being operative to de- 

15 termine a current set of readers of a data block 
which is subject to a write request, the given proc- 
essor node implementing a prediction mechanism" 
which generates a set of predicted readers of the 
data block based on the current set of readers and 

20 at least one add itional set of readers representative 
of at least a portion of a global history of a directory 
associated with the data block. 

29. The system of claim 28 wherein the set of predicted 
25 readers are generated at least in part by applying a 

function to the current set of readers and at least 
one additional set of readers. 

30. The system of claim 29 wherein the function com- 
30 prises at least one of a union function and an inter- 
section function. 

31 . The system of claim 28 wherein the directory and 
the data block comprise elements of a memory as- 

35 sociated with the given processor node. 

32. The system of claim 28 wherein the global history 
of the directory comprises a plurality of sets of pre- 
vious readers processed by the directory, the total 

40 number of the plurality of sets of previous readers 
corresponding to a designated history depth asso- 
ciated with generation of the set of predicted read- 
ers. 

45 33. The system of claim 32 wherein the global history 
of the directory is maintained in a shift register as- 
sociated with a corresponding one of the processor 
nodes and having a number of storage locations 
corresponding to the designated history depth. 

50 

34. The system of claim 28 wherein each of at least a 
subset of the readers corresponds to a particular 
one of the processor nodes in the multiprocessor 
system. 

55 
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FIG. 3 
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FIG. 5 
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FIG. 6 



TABLE 1: TOP 10 PVP DIRECT UPDATE 



. SCHEME 


SIZE. 


SENS. 


SPEC. 


PVP 


PVN. 


inter(pid+Gdd^) 


•14 


0.30 


1.00 


0.92 


0.93 


inter(pid+addg)^ 


18 


0.31 


1.00 


0.92 


0.93 


inter(pid+addg)2 


18 


0.35 


1.00 


0.91 


0.93 


intei^pid+pc^+add^) 4 


18 


0.35 


1.00 


0.90 


0.93 


intei^pid+add^ 


14 


0.34 


1.00 


0.89 


0.93 


intei^pid+dir+add^) 4 


18 


0.32 


1.00 


0.88 


0.93 


inte^pid+pc^+add^)^ 


18 


0.37 


1.00 


0.88 


0.93 


inter(pid) 4 


10 


0.29 


1.00 


0.88 


0.92 


inter(pid+pc^) ^ 


14 


0.37 


0.99 


0.87 


0.93 


inter(pid+pcg)^ 


18 


0.39 


0.99 


0.86 


0.93 



TABLE 2: TOP 10 SPEC. DIRECT UPDATE 



SCHEME 


SIZE 


SENS. 


SPEC. 


PVP 


PVN 


nter(pc4+addg)4 


18 


0.12 


1.00 


0.72 


0.91 


nter(add 4 ) 4 


10 


0.07 


1.00 


0.78 


0.91 


nter(pid+addg)4 


18 


0.31 


1.00 


0.92 


0.93 


nter() 4 


8 


0.03 


1.00 


0.73 


0.90 


nter(add 8 ) 4 


14 


0.11 


1.00 


0.70 


0.91 


nter^pc^+add^) 4 


14 


0.08 


1.00 


0.74 


0.91 


ntei^pcg+add^) 4 


18 


0.09 


1.00 


0.74 


0.91 


nterfrcj) 4 


10 


0.04 


1.00 


0.71 


0.90 


nter(pc 8 ) 4 


14 


0.04 


1.00 


0.71 


0.90 


nterfac^+dir+add^) 4 


18 


0.20 


1.00 


0.73 


0.92 



12 



EP 1 162 542 A1 



FIG. 6 (continued) 



TABLE 3: TOP 10 SENS. DIRECT UPDATE 



SCHEME 


SIZE 


SENS. 


SPEC. 


PVP 


PVN 


union(dir+odd^)^ 


14 


0.67 


0.86 


0.40 


0.95 


union(dir)4 


10 


0.67 


0.85 


0.39 


0.95 


union(pc4+dir)4 


14 


0.67 


0.85 


0.40 


0.95 


unio^pc^+dir+add^)^ 


18 


0.67 


0.85 


0.40 


0.95 


union(dir+oddg)4 


18 


0.67 


0.87 


0.42 


0.95 


union(pcg+dir)4 


18 


0.66 


0.86 


0.42 


0.95 


union(pid+add^)^ 


14 


0.66 


0.87 


0.41 


0.95 


union(pid+addg)^ 


18 


0.66 


0.87 


0.42 


0.95 


unionfodd^)^ 


18 


0.66 


0.86 


0.40 


0.95 


union(add^)^ 


10 


0.66 


0.80 


0.29 


0.95 



TABLE 4: TOP 10 PVP FORWARDED UPDATE 



SCHEME 


SIZE 


SENS. 


SPEC. 


PVP 


PVN 


interfpid+pc^+add^ 


18 


0.34 


1.00 


0.92 


0.93 


inter(pid+add^ 


14 


0.33 


1.00 


0.91 


0.93 


inter(pid+addg)4 


18 


0.33 


1.00 


0.91 


0.93 


inter(pid+dir+add^)^ 


18 


0.34 


1.00 


0.91 


0.93 


inter(pid+Qddg)3 


18 


0.37 


1.00 


0.90 


0.93 


interfaid+add^ 


14 


0.36 


1.00 


0.88 


0.93 


inte^pid+pc^+add^)^ 


18 


0.37 


1.00 


0.88 


0.93 


inter(pid+dir)4 


14 


0.38 


0.99 


0.88 


0.93 


inter(pid+dir+add^)3 


18 


0.37 


1.00 


0.88 


0.93 


inter(pid)^ 


10 


0.33 


1.00 


0.87 


0.93 
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FIG. 6 (continued) 



TABLE 5: TOP 10 SPEC. FORWARDED UPDATE 



SCHEME 


SEE 


SENS, 


SPEC. 


PVP 


PVN 


/ v A 

intei\pc^+addg) 4 


18 


0.11 


1.00 


0,69 


0.91 


inter(add 4 ) 4 


10 


0.07 


1.00 


0.78 


0.91 


interQ 4 


6 


0.03 


1.00 


0.73 


0.90 


inter(addg)4 


14 


0.11 


1.00 


0.70 


0.91 


inter(pid+addg)4 


18 


0.33 


1.00 


0.91 


0.93 


inter(pcg+add^ 


18 


0.09 


1.00 


0.74 


0.91 


inter(pc^+add^)^ 


14 


0.07 


1.00 


0.69 


0.91 


inter(pid+pc^+add^)^ 


18 


0.34 


1.00 


0.92 


0.93 


interfcc^+dir+add^ 


18 


0.18 


1.00 


0.73 


0.91 


interfoc^ 


20 


0.03 


1.00 


0.61 


0.90 




TABLE 6: TOP 10 SENS. FORWARDED UPDATE 


SCHEME 


SEE 


SENS. 


SPEC. 


PVP 


PVN 


union(dir+odd^)^ 


14 


0.67 


0.86 


0.40 


0.95 


union(dir)* 


10 


0.67 


0.85 


0.39 


0.95 


union(dir+addg)^ 


18 


0.67 


0.87 


0.42 


0.95 


union(pc+dir+add^)^ 


18 


0.66 


0.89 


0.45 


0.95 


union(add^)^ 


18 


0.66 


0.86 


0.40 


0.95 


union(pid+dir) * 


14 


0.66 


0.89 


0.45 


0.95 


union(add^)^ 


10 


0.66 


0.80 


0.29 


0.95 


union()4 


6 


0.65 


0.77 


0.25 


0.95 


union(addg)^ 


14 


0.65 


0.84 


0.35 


0.95 


union(pid+addg) 4 


18 


0.65 


0.88 


0.44 


0.95 
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